Markov Chain Attribution

R package build 2021-04-02 1875 min read

What is Attribution Modeling

Before jumping to the data analysis let’s first set the context for the problem that Markov Chain Attribution modeling addresses.

Many Advertising Channels

When marketers spend money on advertising they strategically allocate budget towards different campaigns which in turn use a multitude of advertising channels. Some common adverting channels include cable and streaming TV, Programmatic, Social, Paid Search, Print. How these are defined, and which media partners and platforms are used to spend the planned budget varies, but the fact that advertisers communicate through a multitude of channels remains.

Goals and KPI’s

Let’s keep in mind that marketers don’t advertise for the sake of advertising. They have particular goals and objectives around which a marketing communications strategy is built. One of the roles of Analytics is to help build a measurement plan that will define a set of Key Performance Indicators (KPI’s) that will serve as a measure of success for the marketing efforts. Depending on the goals, and technical ability these KPI’s are chosen and tools for measuring them are put in place. For example we may want to generate visits to a website or a webpage, generate leads through form submissions, have users perform certain actions on our website, raise aware for our brand, or simply drive sales. Whatever the KPI is, a unit of a completed KPI is referred to as a Conversion.

ROI and Attribution

While we are running a campaign and spending our budget, how do evaluate the degree of importance of the different channels in driving our KPI? Obviously if our goal was to drive sales but, all else being equal, our sales did not increase, then none of the channels worked well. But if we do succeed in driving incremental sales, which channel is working best? In other words, how do we Attribute the incremental sales to different channels and compute our Return on Investment (ROI, sometimes referred to as return on advertising (ROAS))? Should we tactically alter our plan because one of the channels is not performing as expected? Answering these questions is one way Analytics ads value to Advertising.

Last-Touch in the Consumer Path

One way to answer this question comes about as a result of digital advertising and consumer level data. Digital advertising can often be cheaper and more easily measured. The tractability allows us to collect data and analyze parts of the Consumer Path that over time lead to a Conversion. A consumer journey exists whether or not it’s tracked, and no one can ever really track the whole journey. What we can track are some of the Media Touchpoints. Often the easiest way to Attribute Conversions is to simply look at the Last-Touch in the Consumer Path. That’s because often that is the only touchpoint we do see. Giving the Last-Touch full credit for driving a Conversion will result in an unfair assessment of our media effectiveness and will not account for the Synergy that a mix of channels aims to create.

library(visNetwork)
library(tidyverse)
nodes <- 
  data.frame(id = 1:4, 
             label = c("Display", "TV", "Search 100%", "Conversion"),
             color = c("grey", "grey", "lightblue", "green"),
             shape = c("square","square","square","square"))
edges = data.frame(from = 2:4, to = 1:3)

visNetwork(nodes, edges, width = "100%") %>% 
  visEdges(arrows = "from") %>% 
  visIgraphLayout(layout = "layout_on_grid")

Thankfully if an adequate effort to track our media is made we can generate Path Data and hope to attribute in a more sophisticated way, that is using Multi-Touch Attribution (MTA). This allows us to better judge the effectiveness of channels used, and make better investment decision.

nodes <- 
  data.frame(id = 1:4, 
             label = c("Display 25%", "TV 35%", "Search 40%", "Conversion"),
             color = c("lightblue", "lightblue", "lightblue", "green"),
             shape = c("square","square","square","square"))
edges = data.frame(from = 2:4, to = 1:3)

visNetwork(nodes, edges, width = "100%") %>% 
  visEdges(arrows = "from") %>% 
  visIgraphLayout(layout = "layout_on_grid")

Multi-Touch Attribution

There are two main categories of MTA methods; Rule-Based and Data-Driven. Of course both of these use data but what distinguishes them is how they assign importance to touch-points along the consumer path. Rule-Based methods heuristically assign a weights to touch-points based on their position. You can find more info about these here.

Data driven methods take a different approach. Instead of assigning arbitrary rules based on position they establish probabilistic relationships between touchpoints and their role in driving Conversions.

Markov Chain Attribution

Markov Chain Attribution is one of the more popular data-driven methods, and as the name suggests it takes advantage of Markov Chains. Unless you studies Operations Research or Math at university you might not know what these chains are so let’s do a high-level intro.

Markov Chains

The key concept to remember about Markov Chains is the Transition Matrix. Markov Chains themselves are a conceptual tool for describing a (stochastic) system. All this means is that when there are random or probabilistic component to a process, Markov Chains might be a good tool to first describe, and then analyze this process. The process we are concerned with here are Customer Journeys towards Conversions.

The Fundamentals

We can visualize a markov chain as a directed network. Each node is a State that a consumer can while following a path. Each arrow is a Transition, and with it there is an associated probability. So there is a predetermined set of touch-points you may have, but which will happened next is uncertain.

library(visNetwork)
library(tidyverse)
library(gt)
nodes <- 
  data.frame(id = 1:6, 
             label = c("Non-Conversion", "Display", "TV", 
                       "Search", "Conversion","Start"),
             color = c("darkred", "grey", "grey", 
                       "grey", "lightblue", "green"))
edges = expand.grid(from = 1:6, to = 1:6)
edges = edges %>% filter(from != to, to != 5, to != 1, from != 6)

visNetwork(nodes, edges, width = "100%") %>% 
  visEdges(arrows = "from") %>% 
  visIgraphLayout(layout = "layout_with_kk")

The conversion and non-conversion nodes have arrows pointing to them from all channels, but no arrows emanating from them. These are called Absorbing States and for us they indicate an end of a journey. When we collect all of the probabilities of transitioning from one state to the next, we organize it in a square Transition Matrix.

M = t(matrix(c(1,    0,   0,   0,  0,
               200,   0, 300, 350, 10,
               160, 350,   0, 300,  8,
               120, 200, 300,   0,  6,
               0,    0,   0,   0,  1), nrow = 5))

M = M/rowSums(M)

colnames(M) = c("Non-Conversion", "Display","TV","Search","Conversion")
row.names(M) = c("Non-Conversion", "Display","TV","Search","Conversion")

as.data.frame(round(M,3), row.names = row.names(M)) %>% 
  rownames_to_column() %>% gt::gt() %>%
  gt::tab_header(title = "Theoretical Transition Matrix")  %>%
  tab_style(style = list(cell_fill(color = "#F1CBC3")),
            locations = cells_body(columns = c(2,6), rows = 2:4)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 3, rows = 3:4)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 4, rows = c(2,4))) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 5, rows = 2:3))
Theoretical Transition Matrix
Non-Conversion Display TV Search Conversion
Non-Conversion 1.000 0.000 0.000 0.000 0.000
Display 0.233 0.000 0.349 0.407 0.012
TV 0.196 0.428 0.000 0.367 0.010
Search 0.192 0.319 0.479 0.000 0.010
Conversion 0.000 0.000 0.000 0.000 1.000

Above is an example Transition Matrix which I just put together for illustration. To make it realistic I set the conversions probabilities to relatively low values. Later we will use this matrix for Simulation!

An important property to keep in mind is that rows sum up to one since we are dealing with probabilities. The rows represent to current sate in the journey, and columns represent the next state.

rowSums(M)
## Non-Conversion        Display             TV         Search     Conversion 
##              1              1              1              1              1

There is one more important thing to know about Markov Chains, and that is the Markov Property or the memorylessness. This will be TMI for some, but this is really the reason why it is possible to summarize the whole process in one simple Transition Matrix. The property is a simplifying assumption stating that the probability of where you transition to next depends only on where you are now, but not on your whole journey history. To be clear, this doesn’t mean that the journey history doesn’t matter. It absolutely does impact attribution results. What it doesn’t impact is the probability. This makes analysis significantly simpler from a mathematical point of view.

From Chains to Paths (Simulation)

In practice, given a data-set of consumer paths we may estimate a Transition Matrix. The reverse is also true. Given a Transition Matrix we can simulate a set of consumer paths. In fact simulation plays an important role in deriving the actual levels of importance for each channel, so it is worth seeing the mechanics of this process.

Below we define a function that will work well with the above sample matrix \(M\). Given this matrix and a number of Markov Chain steps we want to make num_sim we do the following:

  1. Randomly select a starting point by sampling from the set of media touchpoints

    1. In practice we would actually estimate the starting point probabilities.
  2. For each step:

    1. Pick a row of probabilities from \(M\) corresponding to the current state/touchpoint.

      1. Take one sample from a Multinomial distribution with probabilities set to the row.
    2. Set the sample as the new current state and repeat.

  3. If we hit a Conversion or Non-Conversion we end the path.

simulate_path = function(M, num_sim){
  num_sim = num_sim
  path = vector(mode = "integer", length = num_sim)
  path[1] = sample(2:(nrow(M)-1), 1)
  for(i in 1:(num_sim-1)){
    p = M[path[i],]
    sn = which(rmultinom(1, 1, p) == 1)
    path[i+1] = sn
    if(sn == 1 | sn == 5){
      break
    }
  }
  return(path[path > 0])
}

Simulating One Path

Let’s test out this function. The first example is a non-converter.

set.seed(124)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path , 
     type = "l", axes=FALSE, ylim = c(1,5),
     main = "Non-Converter", 
     ylab = "Channel Touchpoint",
     xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

We see that we start out with Display, move to Search, then TV, carry on the journey until step 9 where we drop out without converting.

This is another example but this time we have a converter.

set.seed(007)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path , 
     type = "l", axes=FALSE, ylim = c(1,5),
     main = "Converter", 
     ylab = "Channel Touchpoint",
     xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

This time we started out with TV, jumped around between Display and Search and converted on step 8 with Search being our last touchpoint.

Simulating a Full Dataset of Journeys

We can now simulate a whole set of paths simply by repeating the function call simulate_path. With a few more data wrangling operations we have a pathdf. Here I will generate 5000 paths.

num_sim = 100
num_paths = 10000

paths = purrr::map(1:num_paths, ~simulate_path(M, num_sim))

conversion = 
  purrr::map(paths, ~ data.frame(conversion = 
                                   ifelse(.x[length(.x)] == 5, 
                                          "converter",
                                          "non-converter")
                                 )
             ) %>% bind_rows(.id = "path_num")

pathdf = 
  map(paths, ~data.frame(touchpoint = 1:length(.x), channel = .x)) %>%
  bind_rows(.id = "path_num") %>%
  left_join(conversion) %>%
  left_join(
    data.frame(channel_name = colnames(M), channel = 1:5)
  )

head(pathdf,10) %>% gt::gt() %>%
  gt::tab_header(title = "Simmulated Paths")
Simmulated Paths
path_num touchpoint channel conversion channel_name
1 1 4 non-converter Search
1 2 2 non-converter Display
1 3 1 non-converter Non-Conversion
2 1 4 non-converter Search
2 2 2 non-converter Display
2 3 3 non-converter TV
2 4 4 non-converter Search
2 5 3 non-converter TV
2 6 1 non-converter Non-Conversion
3 1 3 non-converter TV

Let’s now plot all of these paths on top of one another to see how this Markov Chain behaves.

plotly::ggplotly(
pathdf %>%
  ggplot(aes(touchpoint, channel, color = conversion, group = path_num)) + 
  geom_line() +
  labs(x = "Touchpoint Number", y = "Channel Touhcpoint") +
  theme_minimal()
)

From the plot we see that more often lines drop down, which indicates a non-converter. This is as expected. We can see what the simulated conversion rate was. It is in fact just over 5%.

table(conversion$conversion)/nrow(conversion)
## 
##     converter non-converter 
##        0.0462        0.9538

From Paths to Chains

Now that we have simulated a set of consumer paths from our Markov Chain \(M\) we can try to use them to estimate the original \(M\) with \(\hat{M}\) and see how well the estimation method works. Here I will introduce the brilliant ChannelAttribution package authored by Davide Altomare. It is open source and hosted on GitLab here. This package contain a function called transition_matrix which takes in a data-set of paths and estimates the Transition Matrix.

Data Prep

The first order of business is to structure the data in the required format. First we need to parse out the final touchpoint in each path since they represent a conversion/non-conversion event. Then we summarize the data grouping by path and counting number converters and non-converters for each unique path observed.

named_paths = 
  pathdf %>% group_by(path_num) %>%
  group_split() %>%
  map(~pull(.x,channel_name))

path_trim = map(named_paths, ~.x[.x != "Non-Conversion" & .x != "Conversion"])

journeydf = 
  as_tibble(cbind(as.data.frame(do.call(rbind,map(path_trim, ~str_c(.x, collapse = " > ")))),conversion$conversion))
names(journeydf) = c("path","conversion")

journeydf = 
  journeydf %>% 
  group_by(path) %>%
  summarise(converters = sum(if_else(conversion == "converter", 1, 0)),
            non_converters = sum(if_else(conversion == "non-converter", 1, 0))) %>%
  arrange(-converters, -non_converters)

head(journeydf, 15) %>% gt::gt() %>%
  gt::tab_header(title = "Simmulated Journey Data (Top 15 Converters)")
Simmulated Journey Data (Top 15 Converters)
path converters non_converters
Display 41 824
TV 31 608
Search 23 626
Search > TV 17 330
TV > Display 15 319
Search > Display 12 256
Display > Search 10 237
Display > TV 10 232
Display > Search > TV 8 140
TV > Search 7 235
Search > TV > Display 7 155
TV > Display > TV 7 110
TV > Search > Display 7 92
Display > TV > Display 6 117
Search > Display > TV 6 67

Here are some quick summary stats about the generated path data.

Table 2

There are many other wasy to explore this kind of data. One quick and intersection one is the distribution of path lengths.

path_lengths = map_int(path_trim, ~length(.x))
summary(path_lengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.588   6.000  43.000
hist(path_lengths, main = "Path Length Distribution")

Transition Matrix Estimation

Finally, now that we have our simulated data-set with a known transition matrix \(M\) , we can now see how we can go from data back to a transition matrix, via estimation and compute \(\hat{M}\). Here we apply the ChannelAttribution::transition_matrix() function to out data. The function returns a lookup table for the channel ids and a table to transition probabilities.

library(ChannelAttribution)

tM = transition_matrix(journeydf, 
                       var_path = "path", 
                       var_conv = "converters", 
                       var_null = "non_converters")

tM$transition_matrix %>% gt::gt() %>%
  gt::tab_header(title = "Transition Probabilities")
Transition Probabilities
channel_from channel_to transition_probability
(start) 1 0.341000000
(start) 2 0.326500000
(start) 3 0.332500000
1 (conversion) 0.011514909
1 (null) 0.236687966
1 3 0.404885517
1 2 0.346911608
2 (conversion) 0.009719222
2 (null) 0.196480752
2 1 0.431393724
2 3 0.362406302
3 (conversion) 0.008998875
3 (null) 0.191159929
3 2 0.480711970
3 1 0.319129226

We can format and the table and output in a way where it is easy for us to compare the estimated and the true transition matrices

tM = 
  tM$transition_matrix %>%
  left_join(tM$channels %>% 
              mutate(channel_from = as.character(id), from = channel_name) %>%
              select(channel_from, from)) %>%
  left_join(tM$channels %>% 
              mutate(channel_to = as.character(id), to = channel_name) %>%
              select(channel_to, to)) %>%
  mutate(from = if_else(is.na(from), channel_from, from),
         to = if_else(is.na(to), channel_to, to)) %>%
  select(-channel_from, -channel_to) %>%
  pivot_wider(names_from = to, values_from = transition_probability) %>%
  filter(from != "(start)") %>%
  relocate(`(null)`, .before = Display) %>%
  column_to_rownames("from")

tM[is.na(tM)] = 0
as.data.frame(round(tM,3), row.names = row.names(tM)) %>%
  rownames_to_column() %>% gt::gt() %>%
  gt::tab_header(title = "Estimated Transition Matrix") %>%
  gt::tab_header(title = "Theoretical Transition Matrix")  %>%
  tab_style(style = list(cell_fill(color = "#F1CBC3")),
            locations = cells_body(columns = c(2,6), rows = 1:3)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 3, rows = 2:3)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 4, rows = c(1,3))) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 5, rows = 1:2))
Theoretical Transition Matrix
(null) Display TV Search (conversion)
Display 0.237 0.000 0.347 0.405 0.012
TV 0.196 0.431 0.000 0.362 0.010
Search 0.191 0.319 0.481 0.000 0.009
as.data.frame(round(M[-c(1,5),],3), row.names = row.names(M[-c(1,5),])) %>% 
  rownames_to_column() %>% gt::gt() %>%
  gt::tab_header(title = "Original Transition Matrix") %>%
  tab_style(style = list(cell_fill(color = "#F1CBC3")),
            locations = cells_body(columns = c(2,6), rows = 1:3)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 3, rows = 2:3)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 4, rows = c(1,3))) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 5, rows = 1:2))
Original Transition Matrix
Non-Conversion Display TV Search Conversion
Display 0.233 0.000 0.349 0.407 0.012
TV 0.196 0.428 0.000 0.367 0.010
Search 0.192 0.319 0.479 0.000 0.010

Attribution

So far we introduced a theoretical markov chain in the form of a transition matrix \(M\) , used it to simulate a data-set of consumer paths, and returned back to a transition matrix by estimating \(\hat{M}\). Now it’s time to convert, and move on to the actually attribution model!

Removal Effects

Markov Chains seem to be a good idea as a model for Consumer Journeys. But how to use them to actually attribute conversions is not immediately clear. The main idea here is to compute what are called Removal Effects. In essence we are reversing the original question of how many conversions a channel brings, by asking how many less conversions would we have if a channel was omitted from the plan. This quantity is the Removal Effect. If we compute this for each channel in the markov chain then the Relative Removal Effects will be the factor by which we scale the total conversion volume.

Running Attribution

As we just learned running a markov attribution model involves estimating a transition matrix and then computing the removal effects. To do this we simply use the markov_model function, which takes the data in the prepared format. By specifying out_more = TRUE in addition to the attribution results we also get back the transition matrix and removal effects.

mm_res = 
  markov_model(journeydf, 
             var_path = "path", 
             var_conv = "converters", 
             var_null = "non_converters", 
             out_more = TRUE)
## 
## Number of simulations: 100000 - Convergence reached: 2.65% < 5.00%
## 
## Percentage of simulated paths that successfully end before maximum number of steps (44) is reached: 99.99%

Attribution Results

First let’s take a look at the results. The total number of conversions was X. The below table shows how these were attributed by our Data-Driven model.

mm_res$result %>% gt::gt()
channel_name total_conversions
Display 153.6905
TV 156.9407
Search 151.3688

As mentioned before we use the these values are computed using the removal effects. Here are the actual removal effects and explicit code that computes the attributed conversions.

sum(journeydf$converters) * mm_res$removal_effects$removal_effects/sum(mm_res$removal_effects$removal_effects)
## [1] 153.6905 156.9407 151.3688

Quick Comparison to Rule-Based Models

heuristic_models(journeydf, 
                 var_path = "path", 
                 var_conv = "converters") %>%
  left_join(
    data.frame(channel_name = c("2","3","4"), 
               channel = c("Display","TV","Search"))) %>%
  gt::gt()
## Joining, by = "channel_name"
channel_name first_touch last_touch linear_touch channel
Display 154 173 160.6948 NA
TV 147 153 157.8124 NA
Search 161 136 143.4928 NA